CAB: An Energy-Based Speaker Clustering Model for Rapid Adaptation in Non-Parallel Voice Conversion

نویسنده

  • Toru Nakashika
چکیده

In this paper, a new energy-based probabilistic model, called CAB (Cluster Adaptive restricted Boltzmann machine), is proposed for voice conversion (VC) that does not require parallel data during the training and requires only a small amount of speech data during the adaptation. Most of the existing VC methods require parallel data for training. Recently, VC methods that do not require parallel data (called non-parallel VCs) have been also proposed and are attracting much attention because they do not require prepared or recorded parallel speech data, unlike conventional approaches. The proposed CAB model is aimed at statistical non-parallel VC based on cluster adaptive training (CAT). This extends the VC method used in our previous model, ARBM (adaptive restricted Boltzmann machine). The ARBM approach assumes that any speech signals can be decomposed into speaker-invariant phonetic information and speaker-identity information using the ARBM adaptation matrices of each speaker. VC is achieved by switching the source speaker’s identity into those of the target speaker while retaining the phonetic information obtained by decomposition of the source speaker’s speech. In contrast, CAB speaker identities are represented as cluster vectors that determine the adaptation matrices. As the number of clusters is generally smaller than the number of speakers, the number of model parameters can be reduced compared to ARBM, which enables rapid adaptation of a new speaker. Our experimental results show that the proposed method especially performed better than the ARBM approach, particularly in adaptation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

طراحی یک روش آموزش ناموازی جدید برای تبدیل گفتار با عملکردی بهتر از آموزش موازی

Introduction: The art of voice mimicking by computers, has with the computer have been one of the most challenging topics of speech processing in recent years. The system of voice conversion has two sides. In one side, the speaker is the source that his or her voice has been changed for mimicking the target speaker’s voice (which is on the other side). Two methods of p...

متن کامل

Map-based adaptation for speech conversion using adaptation data selection and non-parallel training

This study presents an approach to GMM-based speech conversion using maximum a posteriori probability (MAP) adaptation. First, a conversion function is trained using a parallel corpus containing the same utterances spoken by both the source and the reference speakers. Then a non-parallel corpus from a new target speaker is used for the adaptation of the conversion function which models the voic...

متن کامل

Mapping frames with DNN-HMM recognizer for non-parallel voice conversion

To convert one speaker’s voice to another’s, the mapping of the corresponding speech segments from source speaker to target speaker must be obtained first. In parallel voice conversion, normally dynamic time warping (DTW) method is used to align signals of source and target voices. However, for conversion between non-parallel speech data, the DTW based mapping method does not work. In this pape...

متن کامل

Speaker-independent HMM-based voice conversion using quantized fundamental frequency

This paper proposes a segment-based voice conversion technique between arbitrary speakers with a small amount of training data. In the proposed technique, an input speech utterance of source speaker is decoded into phonetic and prosodic symbol sequences, and then the converted speech is generated from the pre-trained target speaker’s HMM using the decoded information. To reduce the required amo...

متن کامل

Speaker-adaptive-trainable Boltzmann machine and its application to non-parallel voice conversion

In this paper, we present a voice conversion (VC) method that does not use any parallel data while training the model. Voice conversion is a technique where only speaker-specific information in the source speech is converted while keeping the phonological information unchanged. Most of the existing VC methods rely on parallel data—pairs of speech data from the source and target speakers utterin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017